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Abstract 

Motivated by some applications in signal processing and machine learning, we consider two 
convex optimization problems where, given a cone K, a norm || • || and a smooth convex function 
/, we want either 1) to minimize the norm over the intersection of the cone and a level set of /, 
or 2) to minimize over the cone the sum of / and a multiple of the norm. We focus on the case 
where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) 
II • II is "too complicated" to allow for computationally cheap Bregman projections required in 
the first-order proximal gradient algorithms. On the other hand, we assume that it is relatively 
easy to minimize linear forms over the intersection of K and the unit || • ||-ball. Motivating 
examples are given by the nuclear norm with K being the entire space of matrices, or the 
positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on 
the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to 
handle our problems of interest, provide the related theoretical efficiency estimates and outline 
some applications. 



1 Introduction 

We consider two norm-regularized convex optimization problems as follows: 

[norm minimization] min ||a;||, subject to /(x) < 6, (1) 
[penalized minimization] min f{x) + k||x|| (2) 

where / is a convex function with Lipschitz continuous gradient, X is a closed convex cone in a 
Euclidean space E, \\ ■ \\ is some norm, 5 and k are positive parameters. Problems such as such as 
Q and ^ are of definite interest for signal processing and machine learning. In these applications, 
/(x) quantifies the discrepancy between the observed noisy output of some parametric model and 
the output of the model with candidate vector x of parameters. Most notably, / is the quadratic 
penalty: f{x) = ^||^x — y|||, where Ax is the "true" output of the linear regression model x i— )• Ax, 
and y = Ax^, + ^, where x^, is the vector of true parameters, is the observation error, and 6 is 
an a prion upper bound on i||^|||. The cone K sums up a priori information on the parameter 
vectors (e.g., K = E ~ no a priori information at all, or E = R^, K = R^, or E = S^, the space of 
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symmetric p x p matrices, and K = S^, the cone of positive semidefinite matrices, as is the case of 
covariance matrices recovery). Finally, || • || is a regularizing norm "promoting" a desired property 
of the recovery, e.g., the sparsity- promoting norm on £^ = R", or the low rank promoting nuclear 
norm on £^ = R^^"^, or the Total Variation (TV) norm, as in image reconstruction. 

In the large-scale case, first-order algorithms of proximal-gradient type are popular to tackle such 
problems, see |^ for a recent overview. Among them, the celebrated Nesterov optimal gradient 
methods for smooth and composite minimization |22^ \23[ [M] ; and their stochastic approximation 
counterparts |18] . are now state-of-the-art in compressive sensing and machine learning. These 
algorithms enjoy the best known so far theoretical estimates (and in some cases, these estimates 
are the best possible for the first-order algorithms). For instance, Nesterov's algorithm for penalized 
minimization [23^ I24j solves ^ to accuracy e in 0{Dq y^L/e) iterations, where L is the properly 
defined Lipschitz constant of the gradient of /, and Dq is the initial distance to the optimal set, 
measured in the norm || • ||. However, applicability and efficiency of proximal-gradient algorithms in 
the large-scale case require from the problem to possess "favorable geometry" (for details, see 
Section A.6]). To be more specific, consider proximal-gradient algorithm for convex minimization 
problems of the form 

min {/(x) : ||2;|| < 1, a; G K} . (3) 

X 

The comments to follow, with slight modifications, are applicable to problems such as ([T]) and ([2]) 
as well. In this case, a proximal-gradient algorithm operates with a "distance generating function" 
(d.g.f.) defined on the domain of the problem and 1-strongly convex w.r.t. the norm || • ||. Each 
step of the algorithm requires minimizing the sum of the d.g.f. and a linear form. The efficiency 
estimate of the algorithm depends on the variation of the d.g.f. on the domain and on regularity 
of / w.r.t. II • II As a result, in order for a proximal-gradient algorithm to be practical in the 
large scale case, two "favorable geometry" conditions should be met: (a) the outlined sub-problems 
should be easy to solve, and (b) the variation of the d.g.f. on the domain of the problem should 
grow slowly (if at all) with problem's dimension. Both these conditions indeed are met in many 
applications; see, e.g., [21 [T7j for examples. This explains the recent popularity of this family of 
algorithms. 

However, sometimes conditions (a) and/or (b) are violated, and application of proximal algo- 
rithms becomes questionable. For example, for the case oi K = E, (b) is violated for the usual 
II • lloo-iiorm on R^ or, more generally, for || • ||2,i norm on the space of p x q matrices given by 

||x||2i = max ||Row,-(x) II2, 

where RowJ(x) denotes the j-th row of x. Here the variation of (any) d.g.f. on problem's domain 
is at least p. As a result, in the case in question the theoretical iteration complexity of a proximal 
algorithm grows rapidly with the dimension p. Furthermore, for some high-dimensional problems 
which do satisfy (b), solving the sub-problem can be computationally challenging. Examples of 
such problems include nuclear-norm-based matrix completion, Total Variation-based image recon- 
struction, and multi-task learning with a large number of tasks and features. This corresponds to 
II • II in ([1]) or ([2]) being the nuclear norm |10] or the TV- norm. 

^i.e., the Lipschitz constant of / w.r.t. || • || in the nonsmooth case, or the Lipschitz constant of the gradient 
mapping x 1— >■ f'{x) w.r.t. the norm || ■ || on the argument and the conjugate of this norm on the image spaces in the 
smooth case. 
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These limitations recently motivated alternative approaches, which do not rely upon favorable 
geometry of the problem domain and/or do not require to solve hard sub-problems at each iteration, 
and triggered a renewed interest in the Conditional Gradient (CndG) algorithm. This algorithm, 
also known as the Frank- Wolfe algorithm, which is historically the first method for smooth con- 
strained convex optimization, originates from [8], and was extensively studied in the 70-s (see, 
e.g., 13 El [23] and references therein). CndG algorithms work by minimizing a linear form on the 
problem domain at each iteration; this auxiliary problem clearly is easier, and in many cases - 
significantly easier than the auxiliary problem arising in proximal- gradient algorithms. Conditional 
gradient algorithms for collaborative filtering were studied recently |15| I14| . some variants and 
extensions were studied in [6l [29l [TO] . Those works consider constrained formulations of machine 
learning or signal processing problems, i.e., minimizing the discrepancy f{x) under a constraint 
on the norm of the solution, as in ([s]). On the other hand, CndG algorithms for other learning 
formulations, such as norm minimization ([T]) or penalized minimization ^ remain open issues. 
An exception is the work of [6l [10] , where a Conditional Gradient algorithm for penalized mini- 
mization was studied, although the efficiency estimates obtained in that paper were suboptimal. In 
this paper, we present CndG-type algorithms aimed at solving norm minimization and penalized 
norm minimization problems and provide theoretical efficiency guarantees for these algorithms. 

The main body of the paper is organized as follows. In Section [2j we present detailed setting of 
problems Q, Q along with basic assumptions on the "computational environment" required by 
the CndG-based algorithms we are developing. These algorithms and their efficiency bounds are 
presented in Sections [3] (problem ([T])) and [s] (problem Q. In Section[6]we outline some applications, 
and in Section [7| present preliminary numerical results. All proofs are relegated to the appendix. 

2 Problem statement 

Throughout the paper, we shall assume that K C E is a closed convex cone in Euclidean space E; 
we loose nothing by assuming that K linearly spans E. We assume, further, that || • || is a norm on 
E, and / : -ftT — )• R is a convex function with Lipschitz continuous gradient, so that 

\\f'{x)-f'{y)\U<Lf\\x-y\\ yx,y e K, 

where || • denotes the norm dual to || • ||, whence 

yx,yeK: f{y) < f{x) + {f'{x),y - x) + ^\\y - xf. (4) 
We consider two kinds of problems, detailed below. 

Norm-minimization. Such problems correspond to 

= min{||x|| : X £ K, f{x) < 0} . (5) 

X 

To tackle ^ , we consider the following parametric family of problems 

Opt(/3) = min{/(x) : ||x|| < p, x £ K} . (6) 
Note that whenever ([5]) is feasible, which we assume from now on, we have 

/9* = min{p > : Opt(p) < 0}, (7) 
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and both problems ([7]) can be solved. 

Given a tolerance e > 0, we want to find an e-solution to the problem, that is, a pair p^, £ K 
such that 

Pe < p* and Xe € Xp^ such that /(xe) < e, (8) 



where Xp := {x £ E : x £ K, \\x\\ < p}- Getting back to the problem of interest ([5]), is then 
"super-optimal" and e-feasible: 

\\Xe\\ < Pe< P*, f{Xe) < £• 

Penalized norm minimization. These problems write as 

Opt = min{/(x) + : x E K} . (9) 

X 

A equivalent formulation is 

Opt = min {F{[x; r]) = nr + f{x) : x G K, \\x\\ < r} . (10) 



We shall refer to (10) as the problem of composite optimization (CO). Given a tolerance e > 0, our 



goal is to find an e-solution to (10), defined as a feasible solution {x^,r^) to the problem satisfying 



F{[xe;re\) — Opt < e. Note that in this case is an e-solution, in the similar sense, to ([9]). 

Special case. In many applications where Problem ^ arise, ([9]) the function / enjoys a special 
structure: 

f{x) = (l)iAx-b), 

where x i— t- Ax — 6 is an affine mapping from E to R*", and : R™" — t- R is a convex function 
with Lipschitz continuous gradient; we shall refer to this situation as to special case. In such case, 
the quantity Lf can be bounded as follows. Let 7r(-) be some norm on R™, vr=i,(-) be the conjugate 
norm, and be the norm of the linear mapping x i— )■ Ax induced by the norms || • ||, 7r(-) on 

the argument and the image spaces: 

11-^11 IMI,'^(-) = max{7r(^a:;) : ||x|| < 1}. 

Let also )[(/)] be the Lipschitz constant of the gradient of (/) induced by the norm '7r(-), so that 

M<^'{y) - 0'(y')) < L^i-)[4>My - v) ^^'^ ^ F- 

Then, one can take as Lf the quantity 

% = L,(.)M||^||J.|I^^(.). (11) 

Example 1: quadratic fit. In many applications, we are interested in || • ||2-discrepancy between 
Ax and 6; the related choice of 0(-) is (j){y) = \lFy- Specifying 7r(-) as || • ||2, we get L||.||2[(/'] = 1. 
Example 2: smoothed £oo fit. When interested in II • ||oo discrepancy between Ax and 6, we can 
use as cj) the function (t){y) = ^Hyll^, where /3 G [2,oo). Taking 7r(-) • oo) ^6 get 

^||.||^M<(/3-l)m2//3. 
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Note that 



1 m^/l^ 




so that for /3 = 0(1) ln(m) and m large enough (specifically, such that /3 > 2), (j){y) is within absolute 
constant factor of ^Hyll^- The latter situation can be interpreted as behaving as ^|| • ||^). At 
the same time, with /3 = 0(l)ln(m), L||.||_^ [(/>] < 0(1) ln(m) grows with m logarithmically. 
Another widely used choice of <p{-) for this type of discrepancy is "logistic" function 

4>{y) = -\ry\Y,[eP^^+e-Py^ 

\i=l 

For 7r(-) = II • lloo we easily compute < /3 and ||y||oo < </>(y) < ||y||oo +ln(2n)//3. 

Note that in some applications we are interested in "one-sided" discrepancies quantifying the 
magnitude of the vector [Ax — 6]+ = [max[0, {Ax — ...;max[0, [Ax — rather than the 

the magnitude of the vector Ax — b itself. Here, instead of using = ^||y||^ in the context of 
examples 1 and 2, one can use the functions 4>+{y) = (j){[y]+). In this case the bounds on L^(^.-^[(j)j^] 
are exactly the same as the above bounds on L^(^.)[(j)]. The obvious substitute for the two-sided 
logistic function is its "one-sided version:" (f)-\-{y) = ^^T^{YliLi [^^^^ + -'-]) which obeys the same 
bound for L^^^.) [</>+] as its two-sided analogue. 

First-order and Linear Optimization oracles. We assume that / is represented by a first- 
order oracle - a routine which, given on input a point x E K, returns the value f{x) and the gradient 
f'{x) of / at X. As about K and || • ||, we assume that they arc given by a Linear Optimization 
(LO) oracle which, given on input a linear form (r/, •) on E, returns a minimizcr x[r]] of this linear 
form on the set {x e K : \\x\\ < 1}. We assume w.l.o.g. that for every rj, x[ri] is either zero, or 
is a vector of the || • ||-norm equal to 1. To ensure this property, it suffices to compute {r],x[r]]) 
for x[r]] given by the oracle; if this inner product is 0, we can reset x[ri] = 0, otherwise ||a;[?7]|| is 
automatically equal to 1. 

Note that an LO oracle for K and || • || allows to find a minimizer of a linear form of 2; = [x;r] G 
E'^ := E xHon a set of the form -f^"^[/9] = {[x; r] G E'^ : x e K, \\x\\ < r < p} due to the following 
observation: 

Lemma 1. Let p >Q and rj'^ = [rj; a] G E~^ . Consider the linear form i{z) = {r]'^,z) of z = [x; r] G 
E'^, and let 

p[x[r]];l] ,{r]+,[x[r]];l]) <0, 
, otherwise 



z = 



Then z~^ is a minimizer of £{z) over z G ii'"'"[/o], When a = 0, one has z~^ = p[x[r]]; 1]. 

Indeed, let = [x*;r*] be a minimizer of £{■) over Since ||a:*|| < r* due to [x*;r*] G 

K^[p], we have z* := r*[a;[r/]; 1] G K^[p] due to ||x[r/]|| < 1, and £*{z*) < i?(z*) due to the definition 
of x[r)]. We conclude that any minimizer of £{■) over the segment {s[x[77]; 1] : < s < p} is also a 
minimizer of £{■) over It remains to note that the vector indicated in Lemma clearly is a 

minimizer of £{■) on the above segment. □ 
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3 Conditional Gradient algorithm 



In this section, we present an overview of the properties of the standard Conditional Gradient 
algorithm, and highlight some memory-based extensions. These properties are not new. However, 
since they are key for the design of our proposed algorithms in the next sections, we present them 
for further reference. 



3.1 Conditional gradient algorithm 

Let ii^ be a Euclidean space and X be a closed and bounded convex set in E which linearly spans 
E. Assume that X is given by a LO oracle - a routine which, given on input r] £ E, returns an 
optimal solution xxiij] to the optimization problem 

min (f], x) 

(cf. Section [2]). Let / be a convex differentiable function on X with Lipschitz continuous gradient 
f'{x), so that 

yx,yeX: f{y) < f{x) + {f'{x),y - x) + iL||y - x\\j,, (12) 
where || • ||x is the norm on E with the unit ball X — X. We intend to solve the problem 

/*=min/(a;). (13) 

A generic CndG algorithm is a recurrence which builds iterates xt £ X, t = 1,2,..., in such a way 
that 

f{xt+i) < f{xt+i), (14) 



where 

xt+i = xt + jtlxt - Xt], where xf = xx[f'{xt)] and 7t = ^fr 
Basic implementations of a generic CndG algorithm are given by 



(a) xt+i = xt + jt[xt - Xt], 7i = t^, /^gN 

(b) xt+i £ Argmin^g^,^ /(x), Dt = [xt,xf]; 

in the sequel, we refer to them as CndGa and CndGb, respectively. As a byproduct of running 
generic CndG, after t steps we have at our disposal the quantities 

f*,k = min [/(xfc) + {f'{xk),x- x^)] = f{xk) - {f'{xk),Xk - xx[f'{xk)]), l<k<t, (17) 

which, by convexity of /, are lower bounds on Consequently, at the end of step t we have at 
our disposal a lower bound 

/,*:= max/,,fc</„t = l,2,... (18) 

l<k<t 

on f^. 

Finally, we define the approximate solution xt found in course of t = 1, 2, ... steps as the best - 
with the smallest value of / - of the points xi, ...,xt. Note that xt £ X. 

The following statement summarizes the well known properties of CndG (to make the presen- 
tation self-contained, we provide in Appendix the proof). 
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Theorem 1. For a generic CndG algorithm, in particular, for both CndGa, CndGb, we have 

2L 

f{xt)- U<f{xt)- h<—, t>2; (19) 

and 

/(Xf) -/:<—, t>5. (20) 
Some remarks regarding the conditional algorithm are in order. 

Certifying quality of approximate solutions. An attractive property of CndG is the 
presence of online lower bound /* on /=„ which certifies the theoretical rate of convergence of the 



algorithm, see (20). This accuracy certificate, first established in also provides a valuable 
stopping criterion when running the algorithm in practice. 

CndG algorithm with memory. When computing the next search point x^+i the simplest 
CndG algorithm CndGa only uses the latest answer xf = xx[f'{xt)] of the LO oracle. Meanwhile, 
algorithm CndGb can be modified to make use of information supplied by previous oracle calls; 
we refer to this modification as CndG with memory (CndGM)j^ Assume that we have already 
carried out t — 1 steps of the algorithm and have at our disposal current iterate xt G X (with xi 
selected as an arbitrary point of X) along with previous iterates Xr, t < t and the vectors f'{xr), 
x^ = xxif {xr)]- At the step, we compute f'{xt) and xf = xx[f'{xt)]- Thus, at this point in 
time we have at our disposal 2t points Xr,xf, I < t < t, which belong to X. Let Xt be subset 
of these points, with the only restriction that the points xt, xf are selected, and let us define the 
next iterate x^+i as 

xt+i G Argmin f{x), (21) 
xeConv(Xt) 

that is, 

xt+i = e Arg™i" I / • ^ - °' 5^ = 4 ■ ^^^^ 

xeXt >^^-{K}xeXt [ ^ ' xeXt ) 

Clearly, it is again a generic CndG algorithm, so that conclusions in Theorem [T] are fully applicable 
to CndGM. Note that CndGb per se is nothing but CndGM with Xt = {xt, xf} and M = 2 for all 
t. 

CndGM: implementation issues. Assume that the cardinalities of the sets Xt in CndGM are 
bounded by some M > 2. In this case, implementation of the method requires solving at every 



step an auxiliary problem ( 22 ) of minimizing over the standard simplex of dimension < M — 1 a 
smooth convex function given by a first-order oracle induced by the first-oracle for /. When M 
is a once for ever fixed small integer, the arithmetic cost of solving this problem within machine 
accuracy by, say, the Ellipsoid algorithm is dominated by the arithmetic cost of just 0(1) calls to 
the first-order oracle for /. Thus, CndGM with small M can be considered as implementabl^ 

^Note that in the context of "classical" Frank- Wolfe algorithm - minimization of a smooth function over a poly- 
hedral set - such modification is referred to as Restricted Simplicial Decomposition [131 1121 132j 



^Assuming possibility to solve ( 22 1 exactly, while being idealization, is basically as "tolerable" as the stan- 



dard in continuous optimization assumption that one can use exact real arithmetic or compute exactly eigenval- 
ues/eigenvectors of symmetric matrices. The outlined "real life" considerations can be replaced with rigorous error 
analysis which shows that in order to maintain the efflciency estimates from Theorem [l] it suffices to solve t-th 
auxiliary problem within properly selected positive inaccuracy, and this can be achieved in 0{ln[t)) computations of 
/ and /'. 
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Note that in the special case (Section [2]) , where f{x) = (j){Ax — b), assuming (/>(•) and </>'(•) 
easy to compute, as is the case in most of the apphcations, the first-order oracle for the auxiliary 
problems arising in CndGM becomes cheap (cf. [34] ) . Indeed, in this case (22) reads 



It follows that all we need to get a computationally cheap access to the first-order information on 
5(4 (A*) for all values of A* is to have at our disposal the matrix- vector products Ax, x £ Xt. With our 
construction of Xt, the only two "new" elements in Xt (those which were not available at preceding 
iterations) are xt and xf, so that the only two new matrix- vector products we need to compute 
at iteration t are Axt (which usually is a byproduct of computing f'{xt)) and Axf . Thus, we can 
say that the "computational overhead," as compared to computing f'{xt) and xf = xx[f'{xt)], 
needed to get easy access to the first-order information on gt{-) reduces to computing the single 
matrix- vector product Axf. 

4 Conditional gradient algorithm for parametric optimization 

In this section, we describe a multi-stage algorithm to solve the parametric optimization problem 
Q, ([7]), using the conditional algorithm to solve inner sub-problems. ([6]), ([T]). The idea, originating 
from [19] (see also \22 \ \16 \ IM]). is to use a Newton-type method for approximating from below the 
positive root of Opt{p), with (inexact) first-order information on Opt(-) yielded by approximate 
solving the optimization problems defining Opt(-); the difference with the outlined references is 
that now we solve these problems with the CndG algorithm. 

Our algorithm works stagewise. At the beginning of stage s = 1, 2, we have at hand a lower 
bound ps on p^:, with pi defined as follows: 

We compute /(O), /'(O) and x[/'(0)]. If /(O) < e or x[f'{0)] = 0, we are done — the pair 
(p = 0, X = 0) is an e-solution to ([T]) in the first case, and is an optimal solution to the 
problem in the second case (since in the latter case is a minimizer of / on K, and 
is feasible) . Assume from now on that the above options do not take place ( "nontrivial 
case"), and let 

d=-{no),x[f'm). 

Due to the origin of x[-], d is positive, and f{x) > /(O) + (/'(0),x) > /(O) — d\\x\\ for 
all X £ K, which implies that />* > pi := > 0. 

At stage s we apply a generic CndG algorithm (e.g., CndGa,CndGb, or CndGM; in the sequel, 
we refer to the algorithm we use as to CndG) to the auxiliary problem 

Opt(p,) = min{/(x) : x G K[ps]}, K[p] = {x e K : \\x\\ < p}, (23) 

X 

Note that the LO oracle for K, \\ ■ \\ induces an LO oracle for K[p\; specifically, for every rj £ E, 
the point Xp[r/] := px[r]] is a minimizer of the linear form {r),x) over x G K[p], see LemmajTj Xp[-] 
is exactly the LO oracle utilized by CndG as applied to ([23j). 
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As explained above, after t steps of CndG as applied to (23), the iterates being Xr S K[ps], 
1 < T < t 1^ we have at our disposal current approximate solution xt G {xi,...,xt} such that 
f{xt) = mini<T-<t f{xr) along with a lower bound /* on Opt{ps). Our policy is as follows. 

1. When f{xt) < e, we terminate the solution process and output p = Ps and x = xt] 

2. When the above option is not met and /* < we specify x^+i according to the descrip- 
tion of CndG and pass to step t + 1 of stage s; 

3. Finally, when neither one of the above options takes place, we terminate stage s and pass to 
stage s + 1, specifying ps+i as follows: 

We are in the situation f{xt) > e and /* > Now, ior k < t the quantities f{xk), 

f'{xk) and x[f'{xk)] define affine function of p > 

4(p) = f{xk) + {f'{xk),x - px[f'(xk)]). 

By Lemma [T] we have for every p > 

£k{p) = mm [f{xk) + {f'{xk),x-Xk)]< min /(x) = Opt(p), 

x&K[p] x&K[p] 

where the inequality is due to the convexity of /. Thus, ikip) is an afhne in p > lower 
bound on Opt(p), and we lose nothing by assuming that all these univariate affine functions 



are memorized when running CndG on (23). Note that by construction of the lower bound 
/* (see (17), (18) and take into account that we are in the case of X = K[ps], xx[v] = Psx[r]]) 



we have 



ft=£\ps), = max 4(p). 

l<k<t 



Note that i\p) is a lower bound on Opt{p), so that i^{p) <0 for p> p^:, while i^{ps) = ft is 
positive. It follows that 

r* := min{p : i\p) < O} 

is well defined and satisfies Ps < ^ p*- We compute r* (which is easy) and pass to stage 
s + 1, setting Ps+i = and selecting, as the first iterate of the new stage, any point known 
to belong to K[p] (e.g., the origin, or xt). The first iterate of the first stage is 0. 

The description of the algorithm is complete. 

The complexity properties of the algorithm are given by the following proposition. 

Theorem 2. When solving a PO problem ^ by the outlined algorithm, 

(i) the algorithm terminates with an e-solution, as defined in Section^ (cf. 

(ii) The number Ns of steps at every stage s of the method admits the bound 



iVo < max 



6, ^* ^ + 3 



(iii) The number of stages before termination does not exceed the quantity 

1 r . ^2 - 



max 



1.21„(M±iii^)+ 2.4,3 



''The iterates xt, same as other indexed by t quantities participating in the description of the algorithm, in fact 
depend on both t and the stage number s. To avoid cumbersome notation when speaking about a particular stage, 
we suppress s in the notation. 
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5 Conditional Gradient algorithm for Composite Optimization 



In this section, we present a modification of the CndG algorithm capable to solve composite mini- 



mization problem (10). We assume in the sequel that || • \\.,K are represented by an LO oracle for 



the set {x & K : \\x\\ < 1}, and / is given by a first order oracle. In order to apply CndG to the 



composite optimization problem (10), we make the assumption as follows: 



Assumption A: There exists D < oo such that Kr+f{x) < /(O) together with ||x|| < r, 
X G K, imply that r < D. 

We define D^, as the minimal value of D satisfying Assumption A, and assume that we have at our 
disposal a Enite upper bound on D^, . An important property of the algorithm we are about to 
develop is that its efficiency estimate depends on the induced by problem's data quantity D^, and 
is independent of our a priori upper bound on this quantity, see Theorem below. 



The algorithm. We are about to present an algorithm for solving (10). Let = E xH, and 

= {[x;r] : X G K, \\x\\ < r}. From now on, for a point z = [x;r] £ we set x{z) = x and 
r(z) = r. Given z = [x; r] £ K^, let us consider the segment 

A{z) = {p[x[f'ix)];l]: 0<p<D+}. 

and the linear form 

C = [e;r]^(/'(x),0+Kr=(F'(z),C) 

Observe that by Lemma [l| for every < p < D^, the minimum of this form on A'+[/9] = {[x; r] S 
E~^,x G K, \\x\\ < r < p} is attained at a point of A(z) (either at [px[f'{x)]; p] or at the origin). 
A generic Conditional Gradient algorithm for composite optimization (COCndG) is a recurrence 
which builds the points zt = [xt;rt] G , t = 1, 2, in such a way that 

zi = 0; F{zt+i) < mm{F{z) : z G Conv {A{zt) U {zt})}, t = 1,2, ... (24) 



Let 2:=K = [2;=i,;r=i,] be an optimal solution to (10) (which under Assumption A clearly exists), and 



let = F(z=k) (i.e., F^, is nothing but Opt, see (^). 



Theorem 3. A generic COCndG algorithm (24) maintains the inclusions zt G and is a 



descent algorithm: F{zt-^-l) < F{zt) for all t. Besides this, we have 

F{zt)-F,<^-^,t = 2,3,... (25) 

COCndG with memory. The simplest implementation of a generic COCndG algorithm is given 
by the recurrence 

zi=0; zt+i = [xt+i;rt+i]£Argmm{F{z): z G Conv (A(zt) U {zt})}, t = l,2,.... (26) 

z 

Denoting 'Zj- := D^[x[f'{xT-)]; 1], the recurrence can be written 



zt+i = Xtzt + IJ-tZt, where 

(Ai,/xt) G Argmin \F{Xzt + pzt) : A + /x<l, X>0,p>0}. 



(27) 
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As for the CndG algorithm in section [sj the recurrence (26) admits a version with memory 
COCndGM still obeying (24) and thus sartisfying the conclusion of Theoremjs] Specifically, assume 
that we already have built t iterates Zj- = [xr',rr] G K^, 1 < t < t, with zi = 0, along with the 
gradients f'{xr) and the points x[f'{xr)]- Then we have at our disposal a number of points from 
K^, namely, the iterates Zr, t < t, and the points Zr = D^[x[f'{xr)]; 1]. Let us select a subset Zt 
of the set {zrjZr, 1 < T < t}, with the only restriction that Zt contains the points zt,zt, and set 



Zt+i S Argmin F{z) 
zeCt 



Ct = Conv{{0} U Zt}}. 



(28) 



Since ztj'zt € Zt, we have Conv {A{zt) U {zf})} C Ct, whence the procedure we have outlined is 
an implementation of generic COCndG algorithm. Note that the basic COCndG algorithm is the 
particular case of the COCndGM corresponding to the case where Zt = {zt,zt} for all t. The 
discussion of implement ability of CndGM in section [3] fully applies to COCndGM. 

Let us outline several options which can be implemented in COCndGM; while preserving the 
theoretical efficiency estimates stated in Theorem [3] they can improve the practical performance of 
the algorithm. For the sake of definiteness, let us focus on the case of quadratic /: f{x) = ||^x— 
with Ker^ = {0}; extensions to a more general case are straightforward. 

A. We lose nothing (and potentially gain) when extending Ct in (|28[) to the conic hull 



c+ = {w=Y,k(- Ac > 0, C G Zt} 



of Zt- When K = E, we can go further and replace (28) with 



zt+i £ Argmin < 

z=[x;7'],\ 



f{x) + Kr:x= ^ A^r/, r > ^ |Ac|/> 



(29) 



Note that the preceding "conic case" is obtained from (29) by adding to the constraints of 



the right hand side problem the inequalities > 0,C G Zt. Finally, when || • || is easy to 
compute, we can improve (29) to 



Zt+l 



A* € Argmin 



{Accent} 



C=[v;p]eZt 



\k\p} 



(30) 



(the definition of A* assumes that K = E, otherwise the constraints of the problem specifying 
A* should be augmented by the inequalities A^ > 0, C £ -^t)- 
B. In the case of quadratic / and moderate cardinality of Zt, optimization problems arising in 



(29) (with or without added constraints A^ > 0) are explicitly given low-dimensional "nearly 
quadratic" convex problems which can be solved to high accuracy "in no time" by interior 
point solvers. With this in mind, we could solve these problems for the given value of the 
penalty parameter k and also for several other values of the parameter. Thus, at every it- 
eration we get feasible approximate solution to several instances of ^ for different values 
of the penalty parameter. Assume that we keep in memory, for every value of the penalty 
parameter in question, the best, in terms of the respective objective, of the related approxi- 
mate solutions found so far. Then upon termination we will have at our disposal, along with 
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the feasible approximate solution associated with the given value of the penalty parameter, 
provably obeying the efficiency estimates of Theorem [Sj a set of feasible approximate solutions 
to the instances of ^ corresponding to other values of the penalty. 

C. In the above description, Zt was assumed to be a subset of the set = {zr = [xr', r,-], 2^, 1 < 
T < t} containing Zf and Under the latter restriction, we lose nothing when allowing for Zt 
to contain points from K^\Z^ as well. For instance, when K = E and || • || is easy to compute, 
we can add to Zt the point z[ = [/'(xt); Assume, e.g., that we fix in advance the 

cardinality M > 3 oi Zt and define Zt as follows: to get Zt from Zt-i, we eliminate from the 
latter set several (the less, the better) points to get a set of cardinality < M — 3, and then add 
to the resulting set the points zt, % and z't- Eliminating the points according to the rule "first 
in - first out," the projection of the feasible set of the optimization problem in (30) onto the 
space of j;-variables will be a linear subspace of E containing, starting with step t = M, at 
least [M/3J (here [aj stands for the largest integer not larger than a) of gradients of / taken 
at the latest iterates, so that the method, modulo the influence of the penalty term, becomes 
a "truncated" version of the Conjugate Gradient algorithm for quadratic minimization. Due 
to nice convergence properties of Conjugate Gradient in the quadratic case, one can hope that 
a modification of this type will improve significantly the practical performance of COCndGM. 



6 Application examples 

In this section, we detail how the proposed conditional gradient algorithms apply to several exam- 
ples. In particular, we detail the corresponding LO oracles, and how one could implement these 
oracles efficiently. 

6.1 Regularization by nuclear/trace norm 

The first example where the proposed algorithms seem to be more attractive than the proximal 
methods are large-scale problems ([s]), ([o]) on the space of p x q matrices E = R^^'^ associated 
with the nuclear norm ||cr(x)||i of a matrix x, where a{x) = [ai{x); o-min[p^q] (a;)] is the vector of 
singular values of a p x q matrix x. Problems of this type with K = E arise in various versions of 
matrix completion, where the goal is to recover a matrix x from its noisy linear image y = Ax + ^, 
so that / = (j){Ax — y), with some smooth and convex discrepancy measure 4>{-), most notably, 
4>{z) = 2 II II 2- In this case, ||-|| minimization/penalization is aimed at getting a recovery of low rank 
(|3Il El HI 13 [13 [23 [23 [33 [23 129] and references therein). Another series of applications relates to 
the case when E = is the space of symmetric px p matrices, and K = S!j_ is the cone of positive 
semidefinite matrices, with / and (p as above; this setup corresponds to the situation when one 
wants to recover a covariance (and thus positive semidefinite symmetric) matrix from experimental 
data. Restricted from RP^p onto S^, the nuclear norm becomes the trace norm ||A(x)||i, where 
X{x) G RP is the vector of eigenvalues of a symmetric p x p matrix x, and regularization by this 
norm is, as above, aimed at building a low rank recovery. 

With the nuclear (or trace) norm in the role of || • ||, all known proximal algorithms require, at 
least in theory, computing at every iteration the complete singular value decomposition oi p x q 
matrix x (resp., complete eigenvalue decomposition of a symmetric pxp matrix x), which for large 
p, q may become prohibitively time consuming. In contrast to this, with K = E and || • || = ||o'(-)||i, 
LO oracle for {K, || • || = ||<t(-) || i) only requires computing the leading right singular vector eotapxq 
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matrix rj (i.e., the leading eigenvector of rj^'q): x[ri\ = —fe^, where e = e/||e||2 and / = r/e/||rye||2 for 
nonzero r/ and / = 0, e = when r] = 0. Computing the leading singular vector of a large matrix is, 
in most cases, much cheaper than computing the complete eigenvalue decomposition of the matrix. 
Similarly, in the case of E = S^, K = and the trace norm in the role of || • ||, LO oracle 
requires computing the leading eigenvector e of a matrix rj G S^: 2;[— r/] = ee^, where e = when 
e^rje > 0, and e = e/||e||2 otherwise. Here again, for a large symmetric p x p matrix, the required 
computation usually is much easier than computing the complete eigenvalue decomposition of such 
a matrix. As a result, in the situations under consideration, algorithms based on the LO oracle 
remain "practically implementable" in an essentially larger range of problem sizes than proximal 
methods. 

An additional attractive property of the CndG algorithms we have described stems from the fact 
that since in the situations in question the matrices x[r]] are of rank 1, t-th approximate solution 
xt yielded by the CndG algorithms for composite minimization from Sectionals of rank at most t. 
Similar statement holds true for t-th approximate solution xt built at a stage of a CndG algorithm 
for parametric optimization from Section [3[ provided that the first iterate at every stage is the zero 
matrix^ 

6.2 Regularization by Total Variation 

Given integer n > 2, consider the linear space := R"^". We interpret elements x of as 
images - real- valued functions x{i,j) on the n x n grid r„^„ = {[«; j]) G : < i,j < n}. The 
(anisotropic) Total Variation (TV) of an image x is the ^i-norm of its (discrete) gradient field 
(V.x(-),V,x(-)): 

TV(x) = ||Vix||i + \\Vjx\\i, 
Via;(i, j) = X(i + 1, j) - x{i,j) : Tn-i,n := {[i; j] G : < i < n - 1, < j < n}, 
Vjx{i,j) = X(i,j + 1) - x{i,j) : r„_„_i := {[i; j] G Z^ : < i < n, < j < n - 1} 

Note that TV(-) is a norm on the subspace Mq of M" comprised of zero mean images x (those 
with X^j j 2;(i, j) = 0) and vanishes on the orthogonal complement to Mq, comprised of constant 
images. 

Originating from the celebrated paper [28] and extremely popular Total Variation-based image 
reconstruction in its basic version recovers an image x from its noisy observation b = Ax + ^ by 
solving problems ([5]) or ([o]) with K = E = M", f{x) = (f>{Ax—b) and the seminorm TV(-) in the role 
of II • II . In the sequel, we focus on the versions of these problems where K = E = M" is replaced with 
K = E = Mq, thus bringing the Tl/- regularized problems into our framework. This restriction is 
basically harmless; for example, in the most popular case of f{x) = ^\\Ax — b\\2 reduction to the case 
of X G Mq is immediate ~ it suffices to replace {A, b) with {PA, Pb), where P is the orthoprojector 
onto the orthogonal complement to the one-dimensional subspace spanned by ^e, where e is the all- 
ones imag^ Now, large scale problems ^ with K = E = Mq and TV(-) in the role of || • || are 
difficult to solve by proximal algorithms. Indeed, in the situation in question a proximal algorithm 
would require at every iteration either minimizing function of the form TV(3;) + (e,x) +u!{x) over 

^this property is an immediate corollary of the fact that in the situation in question, by description of the algorithms 
Xt is a convex combination of t points of the form x[-]. 

®When / is more complicated, optimal adjustment of the mean t of the image reduces by bisection in t to solving 
small series of problems of the same structure as ([5|, Q where the mean of the image x is fixed and, consequently, 
the problems reduce to those with x G Mq by shifting b. 
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the entire E, or minimizing function of the form {e,x) + uj(x) on a TV-balQ where uj{x) is albeit 
simple, but nonlinear convex function (e.g., ||a;||2, or ||Vjx||2 + ||Vjx||2). Auxiliary problems of this 
type seem to be difficult in the large scale case, especially taking into account that when running 
a proximal algorithm we need to solve at least tens, and more realistically - hundreds of thenj^ 
In contrast to this, a LO oracle for the unit ball TV = {x £ Mq : TV(x) < 1} of the TV norm is 
relatively cheap computationally - it reduces to solving a specific maximum flow problem. It should 
be mentioned here that the relation between flow problems and TV-based denoising (problem Q 
with ^ = /) is well known and is utilized in many algorithms, see [9] and references therein. While 
we have no doubt that the simple fact stated Lemma [2] below is well-known, for reader convenience 
we present here in detail the reduction mechanism. 

Consider the network (the oriented graph) G with nodes G r„^„ and 2n{n — 1) arcs as 
follows: the first n(n — 1) arcs are of the form {[i + 0<i<n — l,0<j<n, the next 

n{n — 1) arcs are {[i; j + 1], [i; j]) , < i < n, < j < n — 1, and the remaining 2n(n — 1) arcs (let us 
call them backward arcs) are the inverses of the just defined 2n(n — 1) forward arcs. Let £ be the 
set of arcs of our network, and let us equip all the arcs with unit capacities. Let us treat vectors 
from E = Mq as vectors of external supplies for our network; note that the entries of these vectors 
sum to zero, as required from external supply. Now, given a nonzero vector ij G Mq, let us consider 
the network flow problem where we seek for the largest multiple sr] of r] which, considered as the 
vector of external supplies in our network, results in a feasible capacitated network flow problem. 
The problem in question reads 

= max{s : Pr = srj, < r < e} , (31) 



where P is the incidence matrix of our networlijj and e is the all-ones vector. Now, problem (31) 
clearly is feasible, and its feasible set is bounded due to 7^ 0, so that the problem is solvable. 
Due to its network structure, this LP program can be solved reasonably fast even in the large scale 
case (say, when n = 512 or n = 1024, which already is of interest for actual imaging). Further, an 



intelligent network flow solver as applied to (31) will return not only the optimal s = s* and the 



corresponding flow, but also the dual information, in particular, the optimal vector z of Lagrange 
multipliers for the linear equality constraints Pr — sr] = 0. Let z be obtained by subtracting from the 
entries of z their mean; since the entries of z are indexed by the nodes, z can be naturally interpreted 
as a zero mean image. It turns out that this image is nonzero, and the vector x[r]] = — f/TV(z) is 
nothing than a desired minimizer of (ry, •) on TV: 



Lemma 2. Let t] be a nonzero image with zero mean. Then (31) is solvable with positive optimal 



value, and the image x[q\, as defined above, is well defined and is a maximizer of (r/, •) on TV. 



^which one of these two options takes place depends on the type of the algorithm. 

^On a closest inspection, "complex geometry" of the TV-norm stems from the fact that after parameterizing a zero 
mean image by its discrete gradient field and treating this field {g = ViX, h = Vjx) as our new design variable, the unit 
ball of the TV- norm becomes the intersection of a simple set in the space of pairs {g, h) £ F = pj,("~i)><" x p{,"x("-i) 
(the l\ ball A given by \\g\\\ + ll/illi < 1) with a linear subspace P of F comprised of potential vector fields {f,g) - 
those which indeed are discrete gradient fields of images. Both dimension and codimension of P are of order of n^, 
which makes it difficult to minimize over An P nonlinear, even simple, convex functions, which is exactly what is 
needed in proximal methods. 

®that is, the rows of P are indexed by the nodes, the columns are indexed by the arcs, and in the column indexed 
by an arc 7 there are exactly two nonzero entries: entry 1 in the row indexed by the starting node of 7, and entry 
— 1 in the row indexed by the terminal node of 7. 
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Bounding Lf. When applying CndG algorithms to the TV-based problems ([5]), Q with E = Mq 
and f{x) = 4>{Ax — b), the efficiency estimates depend linearly on the associated quantity Lf, which, 
in turn, is readily given by the norm ||^||TV(-),7r(-) of the mapping x i— >• Ax, see the end of Section 
[2] Observe that in typical applications ^ is a simple operator (e.g., the discrete convolution), so 
that when restricting ourselves to the case when 7r(-) is || • II2 (quadratic fit), it is easy to find a 
tight upper bound on ||-^||||.||2,|H|2- convert this bound into an upper bound on ||^||tv{-),||-||2' ^® 
need to estimate the quantity 

Qn = max{||x||2 : x G Mo",TV(x) < 1}. 

X 

Bounding Qn is not a completely trivial question, and the answer is as follows: 

Proposition 1. Qn is nearly constant, specifically, Qn < 0(l)y^ln(n) with a properly selected 
absolute constant 0(1). 

Note that the result of Proposition [T] is in sharp contrast with one-dimensional case, where the 
natural analogy of Qn grows with n as ^/n. We do not know whether it is possible to replace 
in Proposition |l] 0(1) Y^ln(n) with 0(1), as suggested by Sobolev's inequalitie^^ Note that on 
inspection of the proof. Proposition extends to the case of d-dimensional, d > 2, images with zero 
mean, in which case Qn < C{d) with appropriately chosen C{d). 



7 Numerical examples 

We present here some very preliminary simulation results. 



7.1 CndG for parametric optimization: sparse matrix completion problem 

The goal of the first series of our experiments is to illustrate how the performance and requirements 
of CndG algorithm for parametric optimization, when applied to the matrix completion problem 
scale with problem size. Specifically, we apply the algorithm of Section [4] to the problem of 
nuclear norm minimization 

min ||(t(x)||i, subject to {Vij — Xij)"^ < 6, (32) 

where cr(x) is the singular spectrum of a p x g matrix x. In our experiments, the set Q of observed 
entries {i,j) G {1, x {1, q} of cardinality m <^ pq was selected at random. 



Note that the the implementation of the CndGM is especially simple for the problem (32) - 
at each method's iteration it requires solving a simple quadratic problem with dimension of the 
decision variable which does not exceed the iteration count. This allows to implement efficiently 



the "full memory" version of CndGM (CndG algorithms with memory) (21), (22), in which the 
set Xt contains xt and all the points x:^ for 1 < r < t. 

We compare the performance of CndGM algorithms and of a "memory less" version of the CndG. 
To this end we have conducted the following experiment: 



^"From the Sobolev embedding theorem it follows that for a smooth function f{x, y) on the unit square one has 
Il/||i2 ^ 0(l)||V/||i, llV/lli :— ll/^lli + provided that / has zero mean. Denoting by /" the restriction of 

the function onto a n x n regular grid in the square, we conclude that ||/''||2/TV(/") — >■ |l/|U2/ll^/lli — as 
n — >■ 00. Note that the convergence in question takes place only in the 2-dimensional case. 
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1. For matrix sizes p,q £ [1, 2, 4, 8, 16, 32] x 10^ we generate n = 10 sparse px q matrices y with 
density d = 0.1 of non-vanishing entries as follows: we generate p x r matrix U and q x r 
matrix V with independent Gaussian entries Uij ~ M{0,m~^), Vij ~ M{0,n~^), and a r x r 
diagonal matrix D = diag[di, ...,dr] with di drawn independently from a miiform distribution 
on [0, 1]. The non- vanishing entries of the sparse observation matrix y are obtained by sam- 
pling at random with probability d the entries of x* = UDV^, so that for every i,j, yij is, 
independently over set to x*j with probability d and to with probability 1 — d. Thus, 
the number of non-vanishing entries of y is approximately m = dpq. This procedure results 
in m ~ 10^ for the smallest matrices y (1000 x 1000), and in m ~ 10® for the largest matrices 
(32000 X 32000). 

2. We apply to parametric optimization problem (32 ) MATLAB implementations of the CndGM 
with memory parameter M = 1 ( "memoryless" CndG), CndGM with M = 5 and full memory 

CndGM. The parameter b of (32) is chosen to be 5 = 0.001||y||£ (here ||y||f = [y^i^ylA 

stands for the Frobenius norm of y\ The optimization algorithm is tuned to the relative 
accuracy e = 1/4, what means that it outputs an e-solution x to (32), in the sense of ([S]), 
with absolute accuracy e = 5e. 

For each algorithm (memoryless CndG, CndGM with memory M = 5 and full memory CndGM) 
we present in table [T] the average, over algorithm's runs on the (common for all algorithms) sample 
of n = 10 matrices y we have generated, 1) total number of iterations A^it necessary to produce an 
e-solution (it upper-bounds the rank of the resulting e-solutuion), 2) CPU time in seconds Tcpu and 
3) MATLAB memory usage in megabytes S'mem- This experiment was conducted on a Dell Latitude 
6430 laptop equipped with Intel Core i7-3720QM CPU@2.60GHz and 16GB of RAM. Because of 
high memory requirements in our implementation of the full memory CndGM, this method was 
unable to complete the computation for the two largest matrix sizes. 

We can make the following observation regarding the results summarized in table [Tj CndG 
algorithm with memory consistently outperforms the standard - memoryless - version of CndG. 
The full memory CndGM requires the smallest number of iteration to produce an e-solution, which 
is of the smallest rank, as a result. On the other hand, the memory requirements of the full 
memory CndGM become prohibitive (at least, for the computer we used for this experiment and 
MATLAB implementation of the memory heap) for large matrices. On the other hand, a CndGM 
with memory M = 5 appears to be a reasonable compromise in terms of numerical efficiency and 
memory demand. 



7.2 CndG for composite optimization: multi-class classification with nuclear- 
norm regularization 

We present here an empirical study of the CndG algorithm for composite optimization as applied 
to the machine learning problem of multi-class classification with nuclear-norm penalty. A brief 
description of the multi-class classification problem is as follows: we observe "feature vectors" 
€ R*^, each belonging to exactly one of p classes Ci, ...,Cp. Each S,i is augmented by its label 
yi G {l,...,j?} indicating to which class belongs. Our goal is to build a classifier capable to 
predict the class to which a new feature vector ^ belongs. This classifier is given hy a, p x q matrix 
X according to the following rule: given ^, we compute the p-dimensional vector and take, as 
the guessed class of ^, the index of the largest entry in this vector. 

In some cases (see [6l [lO]), when, for instance, one is dealing with a large number of classes. 
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Matrix size 
P X <? 


Memory-less CndG 


CndGM with memory M ^ 5 


Full memory CndG 


iVit 


Tcpu 


•S'mcm 


Nit 


Tcpu 




Nit 


Tcpu 


5'mciii 


1000 X 1000 


271.6 


9.35 


17.11 


149.7 


5.01 


17.63 


78.4 


4.71 


78.98 


1000 X 2000 


292.1 


12.14 


31.67 


162.8 


7.76 


32.57 


93.5 


10.89 


156.22 


2000 X 2000 


246.8 


17.01 


54.45 


139.1 


11.19 


61.57 


71.9 


13.31 


248.13 


2000 X 4000 


259.3 


33.94 


105.09 


152.3 


24.50 


120.22 


57.7 


25.54 


410.02 


4000 X 4000 


321.8 


79.20 


207.26 


162.9 


50.59 


235.59 


74.6 


93.22 


1014.7 


4000 X 8000 


360.1 


169.8 


399.16 


147.3 


88.81 


464.68 


63.3 


135.6 


1766.4 


8000 X 8000 


323.4 


302.8 


754.46 


111.8 


134.1 


905.98 


53.6 


191.3 


3061.5 


8000 X 16000 


324.1 


614.3 


1485.6 


118.2 


286.5 


1800.7 


50.5 


329.4 


5826.7 


16000 X 16000 


258.7 


995.4 


2898.5 


99.7 


495.5 


3577.8 


50.8 


595.2 


11696 


16000 X 32000 


276.7 


2572 


5721.7 


70.3 


859.2 


7109.0 


NA 


NA 


NA 


32000 X 32000 


305.4 


5028 


11352 


57.6 


2541 


14186 


NA 


NA 


NA 



Table 1: memoryless CndG vs. CndGM with memory M = 5 vs. fuh memory CndGM. A^it: total 
number of method iterations; Tcpu: CPU usage (sec), and Smem- memory usage (MB) reported by 
MATLAB. 



there are good reasons "to train the classifier" — to specify x given the training sample {(,i,yi) 
1 < i < N — as the optimal solution to the nuclear norm penalized minimization problem 



Opt(«;) = min Fk(x) := — Vlog < Vexp ((xj - xlj)^,) I +K||cr(x)||i, (33) 

i=l K £=1 J 

where xj is the i-th row in x. 

Below, we report on some experiments with this problem. Our goal was to compare two versions 



of CndG for composite minimization: the memoryless version defined in (24) and the version with 



memory defined in (28). To solve the corresponding sub-problems, we used the Center of Gravity 



method in the case of ( 24 ) and the Ellipsoid method in the case of ( 28 ) |22^ [21] . In the version 
with memory we set M = 5, as it appeared to be the best option from empirical evidence. We have 
considered the following datasets: 

1. Simulated data: for matrix of sizes p,q £ 10^ x {2*}^^-|^, we generate random matrices 

= USV, with p X p factor U, q x q factor V, and diagonal p x q factor S with random 
entries sampled, independently of each other, from M{0,p^^) (for U), M{0,q^^) (for V), 
and the uniform distribution on [0, 1] (for diagonal entries in S). We use N = 20q, with 
the feature vectors (,i,---,(,n sampled, independently of each other, from the distribution 
J\f{0,lq), and their labels yi being the indexes of the largest entries in the vectors x^,^i + £«, 
where G R^' were sampled, independently of each other and of ^i, .■■,£,n, from Af{0, ^Ip)- 
The regularization parameter k is set to lO^^Trf 

2. Real-vi^orld data: we follow a setting similar to We consider the Pascal ILSVRC2010 
ImageNet dataset and focus on the "Vertebrate-craniate" subset, yielding 1043 classes, with 
20 examples per class. The goal here is to train a multi-class classifier in order to be able 
to predict the class of each image (example) of the dataset. Each example is converted to 
a 65536-dimensional feature vector of unit ^i-norm using state-of-the-art visual descriptors 
known as Fisher vector representation [10]. To summarize, we have p = 1043, q = 65536, 
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Matrix size 
pxq 


Memory-less CndG 


CndGM with memory M = 5 


iVit 


Tcpu 


5'mcm 


iVit 


Tcpu 


5'mcm 


2000 X 2000 


172.9 


349.7 


134.4 


99.70 


125.1 


174.1 


4000 X 4000 


153.4 


1035 


541.8 


88.2 


575.2 


704.1 


8000 X 8000 


195.3 


2755 


2169 


120.4 


1284 


2819 


16000 X 16000 


230.2 


6585 


8901 


134.3 


3413 


11550 


32000 X 32000 


271.4 


26370 


30300 


140.4 


17340 


30500 


1043 X 65536 


183 


2101 


2087 


111 


925.34 


2709 



Table 2: memoryless CndG vs. CndGM with memory Af = 5. Nit- total number of method 
iterations; Tcpu: CPU usage (sec) reported by MATLAB. 



= 20860. We set the regularization parameter to k = 10~^, which was found to result in 
the best predictive performance as estimated by cross-validation, a standard procedure to set 
the hyper parameters in machine learning 

In both sets of experiments, the computations are terminated when the "e-optimality condi- 
tions" 

h{f'{xt))\\oo < K + e , . 

{nxt),xt) + 4<^{xt)h < e\\a{xt)\\i 

were met, where ||o"(-)||oo denotes the usual operator norm (the largest singular value). These 
conditions admit transparent interpretation as follows. For every x, the function 

(I^Kix) = f{x) + (/(x),x - x) + K\\a{x)\\i 



underestimates Fre(x), see (33), whence Opt(K') > f{x) — {f'{x),x) whenever k' > ||o'(/'(x))||oc 
Thus, whenever x = xt satisfies the first relation in ( p4| ), we have Opt(K + e) > f{xt) — {f'{xt),xt) 
whence 

F^{xt) - Opt{K + e) < {f'{xt),xt) + K\\a{xt)\\i. 



We see that (34) ensures that Fk^xi) — Opt(K + e) < e ||(j(xf)||i, which, for small e, is a reason- 
able substitute for the actually desired termination when Ffi{xt) — Opt(K) becomes small. In our 
experiments, we use e = 0.001. 

In table [2] for each algorithm (memoryless CndG, CndGM with memory M = 5) we present 
the average, over 20 collections of simulated data coming from 20 realizations of x^, of: 1) total 
number of iterations A^it necessary to produce an e-solution, 2) CPU time in seconds Tcpu- The last 
row of the table corresponds to the real-world data. Experiments were conducted on a Dell R905 
server equipped with four six-core AMD Opteron 2.80GHz CPUs and 64GB of RAM. A maximum 
of 32GB of RAM was used for the computations. 

We draw the following conclusions from table [T] CndG algorithm with memory routinely out- 
performs the standard - memoryless - version of CndG. However, there is a trade-off between the 
algorithm progress at each iteration and the computational load of each iteration. Note that, for 



large M, solving the sub-problem (28) can be challenging. 



18 



7.3 CndG for composite optimization: TV-regularized image reconstruction 



Here we report on experiments with COCndGM as applied to TV-regularized image reconstruction. 
Our problem of interest is of the form ([9]) with quadratic /, namely, the problem 

min ^^{x) ■= \ \\PAx - Pb\\l +kTV(x); (35) 



fix) 



for notation, see section 6.2 



Test problems. In our experiments, the mapping x i— )• Ax is defined as follows: we zero-pad x 
to extend it from r„^„ to get a finitely supported function on Z^, then convolve this function with 
a finitely supported kernel a(-), and restrict the result onto Tn,n- The observations b G M" were 
generated at random according to 

bij = {Ax)ij + <T||a;||oo6i, ~ -^(0' 1)' '^<h3 (36) 

with mutually independent ^jj. The relative noise intensity cr > 0, same as the convolution kernel 
a(-), are parameters of the setup of an experiment. 



The algorithm. We used the COCndG with memory, described in section [5| we implemented 
the options listed in A - C at the end of the section. Specifically, 

1. We use the updating rule (30) with Zt evolving in time exactly as explained in item C: 
the set Zt is obtained from Zt-i by adding the points zt = [a;^; TV(xt)], % = [a;[V/(xt)]; 1] 
and z[ = [V/(xt); TV(V/(xt))], and deleting from the resulting set, if necessary, some "old" 
points, selected according to the rule "first in - first out," to keep the cardinality of Zt not to 
exceed a given M > 3 (in our experiments we use M = 48). This scheme is initialized with 
Zo = 0, zi = [0;0]. 

2. We use every run of the algorithm to obtain a set of approximate solutions to (35) associated 
with various values of the penalty parameter k, as explained in B at the end of section [5j 
Precisely, when solving (35) for a given value of n (in the sequel, we refer to it as to the 
working value, denoted k^), we also compute approximate solutions Xk(k') to the problems 
with the values k' of the penalty, for n' = kj, with 7 running through a given finite subset 
G 9 1 of the positive ray. In our experiments, we used the 25-point grid G = {7 = 2^/'^}J^_-|^2- 

The LO oracle for the TV norm on Mq utilized in COCndGM was the one described in Lemma 
[2] the associated fiow problem (31 ) was solved by the commercial interior point LP solver mosekopt 
version 6 p]. Surprisingly, in our application this "general purpose" interior point LP solver was 
by orders of magnitude faster than all dedicated network flow algorithms we have tried, including 
simplex-type network versions of mosekopt and CPLEX. With our solver, it becomes possible to 
replace in (31 ) every pair of opposite to each other arcs with a single arc, passing from the bounds 
< r < e on the flows in the arcs to the bounds — e < r < e. 

The termination criterion we use relies upon the fact that in COCndGM the (nonnegative) 
objective decreases along the iterates: we terminate a run when the progress in terms of the 
objective becomes small, namely, when the condition 

(pnixt-i) - (p^ixt) < emax[</>K(xt_i),5(/)K(0)] 

is satisfied. Here e and S are small tolerances (we used e = 0.005 and 5 = 0.01). 
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Organization of the experiments. In each experiment we select a "true image" x* G M", a 
kernel a(-) and a (relative) noise intensity a. Then we generate a related observation b, thus ending 



up with a particular instance of (35). This instance is solved by the outlined algorithm for working 
values Kw of k taken from the set = {7 = 2^/^}^_^, with the initial working value, selected in 
pilot runs, of the penalty underestimating the best - resulting in the best recovery - penalty. 

As explained above, a run of COCndGM, the working value of the penalty being yields 25 



approximate solutions to (35) corresponding to k along the grid k„ ■ G. These sets are fragments 
of the grid G"^, with the ratio of the consecutive grid points 2^/^ w 1.19. For every approximate 
solution X we compute its combined relative error defined as 



v{x) 



x'llillx — 2;"||2||X — ^^^'^ 
||a;*||i||x*||2||a;*||c 



here x is the easily computable shift of x by a constant image satisfying \\Ax — h\\2 = \\PAx — Ph\\2. 
From run to run, we increase the working value of the penalty by the factor 2^/^, and terminate 
the experiment when in four consecutive runs there was no progress in the combined relative error 
of the best solution found so far. Our primary goals are (a) to quantify the performance of the 
COCndGM algorithm, and (b) to understand by which margin, in terms of (pui'), the "byproduct" 



approximate solutions yielded by the algorithm (those which were obtained when solving (35) with 
the working value of penalty different from k) are worse than the "direct" approximate solution 
obtained for the working value k of the penalty. 

Test instances and results. We present below the results of four experiments with two popular 
images; these results are fully consistent with those of other experiments we have conducted so far. 
The corresponding setups are presented in table [3] Table [4] summarizes the performance data. Our 
comments are as follows. 

• In accordance to the above observations, using "large" memory (with the cardinality of Zt 
allowed to be as large as 48) and processing "large" number (25) of penalty values at every 
step are basically costless: at an iteration, the single call to the LO oracle (which is a must 
for CndG) takes as much as 85% of the iteration time. 

• The COCndGM iteration count as presented in table [4] is surprisingly low for an algorithm 
with sublinear 0(l/t) convergence, and the running time of the algorithm appears quite 
tolerablcEE 

Seemingly, the instrumental factor here is that by reasons indicated in C, see the end of 
section [sj we include into Zt not only zt = [xt;TY{xt)] and % = [x[V f{xt)];l], but also 
z't = [V/(xt); TV(V/(xt))]. To illustrate the difference, this is what happens in experiment 
A with the lowest (0.125) working value of penalty. With the outlined implementation, the 
run takes 12 iterations (111 sec), with the ratio (l)i/s{xt)/(j)i/s{xi) reduced from 1 (t = 1) to 
0.036 (i = 12). When z'^ is not included into Zt, the termination criterion is not met even 
in 50 iterations (452 sec), the maximum iteration count we allow for a run, and in course of 
these 50 iterations the above ratio was reduced from 1 to 0.17, see plot e) on figure [!} 



^^For comparison: solving on the same platform problem (35 1 corresponding to Experiment A (256 x 256 image) 
by the state-of-the-art commercial interior point solver mosekopt 6.0 took as much as 3,727 sec, and this - for a 
single value of the penalty (there is no clear way to get from a single run approximate solutions for a set of values of 
the penalty in this case). 
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# 


Image 


n 


«(■) 


Cond(^M) 


a 


A 


lenna^ 


256 


f speciaK 'gaussian' ,7, 1)^ (7x7) 


^ 2.5e7 


0.05 


B 


cameraman'^ 


512 


f speciaK 'gaussian' ,7, 1) (7x7) 


^ 2.5e7 


0.05 


C 


lenna 


256 


f speciaK 'unsharp' ) (3x3) 


« 40 


0.15 


D 


cameraman 


512 


f speciaK 'unsharp' ) (3x3) 


^ 40 


0.40 



http://cn.wikipedia.org/wiki/Lenna "'"http://cn.wikipcdia.org/wiki/Camcra_operator 
http://www.mathworks.com/help/images/ref/fspecial.html 



Table 3: Setups of the experiments. 





Iterations per 
run 


CPU per run, 
sec 


CPU per 
iteration, sec 


# 


Image size 


Runs 


min 


mean 


max 


mean 


max 


mean 


A 


256x256 


6 


4 


9.00 


12 


83.4 


148.7 


8.3 


B 


512x512 


9 


4 


7.89 


11 


212.9 


318.2 


25.9 


C 


256x256 


6 


17 


17.17 


18 


189.7 


214.7 


10.3 


D 


512x512 


6 


16 


16.00 


16 


615.9 


768.3 


36.0 



Table 4: Performance of COCndGM; platform: T410 Lenovo laptop, Intel Core 17 M620 
CPU@2.67GHz, 8GB RAM. Flow solver: interior point method mosekopt 6.0 [Ij 



• An attractive feature of the proposed approach is the possibility to extract from a single 
run, the working value of the penalty being k„, suboptimal solutions Xk„{k) for a bunch of 
instances of ^ differing from each other by the values of the penalty k. The related question 
is, of course, how good, in terms of the objective (pni-), are the "byproduct" suboptimal 
solutions Xi^^{k) as compared to those obtained when k is the working value of the penalty. 
In our experiments, the "byproduct" solutions were pretty good, as can be seen from plots 
(a) - (c) on figure [T| where we see the upper and the lower envelopes of the values of 0^ at 
the approximate solutions obtained from different working values of the penalty. 

In spite of the fact that in our experiments the ratios k/kw could be as small as 1/8 and as 
large as 8, we see that these envelopes are pretty close to each other, and, as an additional 
bonus, are merely indistinguishable in a wide neighborhood of the best (resulting in the best 
recovery) value of the penalty (on the plots, this value is marked by asterisk). 

Finally, we remark that in experiments A, B, where the mapping A is heavily ill-conditioned (see 
tablejs]), TV regularization yields moderate (just about 25%) improvement in the combined relative 
recovery error as compared to the one of the trivial recovery ( "observations as they are" ) , in spite of 
the relatively low {a = 0.05) observation noise. In contrast to this, in the experiments C, D, where 
A is well-conditioned, TV regularization reduces the error by 80% in experiment C (d = 0.15) and 
by 72% in experiment D (o" = 0.4), see figure [2] 
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(a) (b) 




Figure 1: (a) ~ (d): lower and upper envelopes of (k)) : G G} vs. k, experiments A - D. 

Asterisks on the K-axes: penalties resulting in the smallest combined relative recovery errors, (e): values of 
4>i/s{xt) vs. iteration number t with included (asterisks) and not included (circles) into Z^. 
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C: True image C: Observations, a = 0.15 C: TV recovery, k, = 0.250 




D: True image D: Observations, a = 0.40 D: TV recovery, k = 0.00328 



Figure 2: Experiments C, D 
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8 Appendix 

8.1 Proof of Theorem [T] 

Define 



et = f{xt) - /*, At = max{f'{xt),xt - x) = {f'{xt),xt - xf) 



where x^ = xx[f'{xt)]. Denoting by a;* an optimal solution to (13) and invoking the definition of 
x^ and convexity of /, we have 



{f'i^t),xt - xt) < {f'{xt),x^ - xt) < f* - f{xt). 



(37) 



Observing that for a generic GC algorithm we have /(xj+i) < f{xt + "ftixf — xt)) and invoking 
( (l2| ), we have 

f{xt+i) < f{xt) + 7t{f'{xt),xt -Xt) + ^7?\\xt-xt\\l < f{xt)-lt{f{xt) - h) + \L^l (38) 



where the concluding < is due to (37). It follows that et+i < (1 — 7t)et + \L'^u whence 



i=l k=i+l 



i=l 



i=l 



k=i+l 



where, by convention, = 1. Noting that 111=2+1 (1 

1, t, we get 



2 ^ 
k+1' 



n 



fc-l _ iji+l) 
k=i+l k+l 



< 



i(i + l) 



^ {i + l)H{t + l) - (t + l) 



2Lt . X 1 



(39) 



what is (19). 

To prove (20), observe that setting = mini<fc<t A^, and invoking (17), (18) we clearly have 

fixt) - ft = rami<k<t[fixt) - f*,k] = mmi<k<t[f (xt) - f{xk) + A^] 
< mini<fc<( Afc = At 

(we have used the fact that f{xt) < f{xk), k < t,hy definition of xt). We see that in order to prove 
( pO] ), it suffices to prove that 

A R I 

(40) 



- 4.5L 

At < — , t = 5,6,.. 



To verify ( 40 ) , note that by the first inequality in ( 38 ) 



7feAfc = -fk{f'ixk),Xk - x^) <ek- efc+i + 



(41) 



Assuming t > 2 and summing up these inequalities over k varying from to =J^/2L to t (here Ja[ 
stands for the largest integer strictly smaller than a), we obtain 



IkAk < eto + h^Yl 

k=to k=to 
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and therefore 



k=to 



k=to 



(42) 



Observe that ^ 



k=to 



Ik 



2ELto(^ + ^ 2[ln(t + 1) - ln(to + 1)] > 21n(2) and ^ 



k=to 7fc 



4ELio(^ + ^)~^ - ^[*o^ ~ *~"^] - I§=^- Assuming t > 4 (so that to > 2) and substituting into 



42[) the bound (19) for et^ we obtain 



- 2L 
At < — + , 
t t{t 



Lit + 2) 



2)ln(2) 



< 4.5L(t - 2) 



as required in ( 40 ) 



□ 



8.2 Proof of Theorem [2] 

The proof, up to minor modifications, goes back to [T9], see also [221 [TB]; we provide it here to make 
the paper self-contained. W.l.o.g. we can assume that we are in the nontrivial case (see description 
of the algorithm). 

l". As it was explained when describing the method, whenever stage s takes place, we have 
[0 <]pi < Ps 1^ P*, and ps-i < Ps, provided s > 1. Therefore by the termination rule, the output 
p, X of the algorithm, if any, satisfies p < p*, f{x) < e. Thus, (i) holds true, provided that the 
algorithm does terminate. Thus, all we need is to verify (ii) and (iii). 

2". Let us prove (ii). Let s > 1 be such that stage s takes place. Setting X = K[ps], observe 
that X — X C {x ^ E : \\x\\ < 2ps}, whence || • || < 2ps|| • ||x, and therefore the relation ^ implies 



the validity of (12) with L = Ap'^Lj. Now, if stage s does not terminate in course of some number 
t steps, then, in the notation from the description of the algorithm, f{xt) > e and /* < |/(xj), 
whence f{xt) — fl > e/4. By Theorem [ijii, the latter is possible only when 4:.5L/[t — 2) > e/4. 
Thus, t < max 5,2 + 



72piL 



Taking into account that ps < p*, (ii) follows. 

3". Let us prove (iii). This statement is trivially true when the number of stages is 1. Assuming 
that it is not the case, let 5 > 1 be such that the stage 5' + 1 takes place. For every s = 1, 5, let 
ts be the last step of stage s, and let Ug, be what in the notation from the description of stage 
s was denoted f{xt^) and ^*°(/o). Thus, > e is an upper bound on Opt(/)s), is '■= (-^{Ps) is a lower 
bound on Opt(ps) satisfying > 3us/4, and ^'^(•) is a piecewise linear convex in p lower bound on 
Opt(/5), p > 0, and ps+i > Ps is the smallest positive root of Let also —Qs be a subgradient 

of £**(•) at Ps- Note that > due to ps+i > Ps combined with (.^{ps) > 0, £^{ps+i) = 0, and by 
the same reasons combined with convexity of £*(•) we have 



Ps+i- Ps>L/gs, (43) 



and, as we have seen. 



(a) Us > e, 

l<s<S^ { (b) us> Optips) >£s> lus, . (44) 

(c) is-gs{p- Ps) <Opt{p), p>0. 
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Assuming 1 < s < 5 and applying (43), we get ps — Ps-i > ^Us-i/ds-i, whence, invoking (44) 



Us-l > Opt(ps_i) > 4 +gs[ps-l - Ps] > "^Us + '^Us-l^^. 

4 4 gs-i 

The resulting inequality implies that ^ + ^ < |, whence uZlgl-i - (l/4)(4/3)^ = 4/9. It 
follows that 

V^s < {2/3y-^,/^, l<s<S. (45) 



Now, since the first iterate of the first stage is 0, we have ui < /(O), while (44) applied with s = 1 



implies that /(O) = Opt(O) > ii + pigi > pigi, whence uigi < /(0)/pi = d. Further, by (43) 



9s > ^s/{Ps+i — Ps) > Q^s/p* > I'Us/p*, where the concluding inequality is given by (44). We see 



that Usgs > ^Usl p* > /p*- This lower bound on Uggs combines with the bound uigi < d and 



with (45) to imply that 



e < V4/3(2/3)'"Vf^/0*> 1 < s < 5. 

Finally observe that by the definition of and due to the fact that \\x[f' 
case, we have 

1 



1 in the nontrivial 



< f{p.x[fm) < /(O) + P*(/'(0),x[/(0)]) + ^Lfpl = f{Q)-p,d+\Lfpl 



(we have used (M| and the definition of d), whence < /(O) + \Lfpl and therefore 



e < y374(2/3)^-y/(0) + \Lfpl l<s<S. 
Since this relation holds true for every S >! such that the stage S + 1 takes place, (iii) follows. □ 



8.3 Proof of Theorem |3] 

By definition of zt we have zt G for all t and -F(O) = -^(^i) > ^{^2) ^ •••) whence rt < for 
all t by Assumption A. Besides this, r^: < as well. Let now et = F{zt) — F^, zt = [xt^rt], and 
let zf = [xf ,rf\ be a minimizer, as given by Lemma [l| of the linear form {F'(zt),z) oi z G 
over the set ir"'"[r*] = {J2;;r] : x G ||2;|| < r < r*}. Recalling that F'[zt) = [f'{xt)]K\ and that 
rt ^ D^: < D~^, Lemma 111 implies that z^ G A(zf). By definition of z^ and convexity of F we have 

{[f'{xt);K],zt- z:^) = {f'{xt),xt-xt) + K{rt-rf) 
> {f'{xt),xt-x^)+K{rt-r^) 
= {F'{zt), zt-z,)>F{zt)-Fiz,) = et. 



Invoking (12), it follows that for < s < 1 one has 



F{zt + s{z+ -Zt)) < F{zi) + s{[f'{xty,^],zt -zt) + ^\\x{z+)-x{zt) 



Lfs^ 
t -^t)+ 2 

, r. ^2 



< F{zt)-s€t + lLfs\rt + D,) 



using that ||j;(zj^)|| < rf and ||x(2;*)|| < rt due to zf, zt G , and that rf <r^ < D^. By (24) we 
have 

F(zt+i) < min F{zt + s(z+ - zt)) < F{zt) + min {-set + \LfS^{rt + D^)H , 
0<s<l 0<s<l z J \ /J 
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and we arrive at the recurrence 



et+i < Q 



1,2,... 



(46) 



When t = 1, this recurrence, in view of zi = 0, imphes that 62 < \LfD^. Let us show by induction 
in t > 2 that 



et < et 



t + U 



t = 2,3,.. 



(47) 



thus completing the proof. We have already seen that (47) is valid for t = 2. Assuming that (|47j) 
holds true for t = A: > 2, we have < \LfDl and therefore ek+i < — §77752^1 by (46) combined 



with < Tfc < D*. Now, the function s — gj^jjyis'^ is nondecreasing on the segment < s < ALjDl 
which contains and < e^, whence 



< 



1 



< 



SLfDUk + 13) 
(A; + 14)2 



4 



< 



(fc + l) + 14' 



k + U 



1 



SLfDl 



SLfDl 
k + U 



so that (47) holds true for t = A; + 1. 



□ 



8.4 Proofs for Section |6] 



Proof of Lemma [2j As we have already explained, (31) is solvable, so that z is well defined. 
Denoting by (s*,r*) an optimal solution to (31) produced, along with z, by our solver, note that 
the characteristic property of z is the relation 

{s* ,r*) £ Argmax {s + {z, Pr — sij) : < r < e}. 



Since the column sums in P are zeros and t] is with zero sum of entries, the above characteristic 
property of z is preserved when passing from z to z, so that we may assume from the very beginning 
that z = z is a zero mean image. Now, P = [Q, —Q], where Q is the incidence matrix of the network 
obtained from G by eliminating backward arcs. Representing a flow r as [rf, ri,], where the blocks 
are comprised, respectively, of flows in the forward and backward arcs, and passing from r to 
p = rj — ri,, our characteristic property of z clearly implies the relation 



[s , P 



7 



rl) G Argmax {s + {z, Qp - sr]) : \\p\\oo < 1}- 

s,p 



It follows that 



(a) {z,r]) = l, 

(b) \\p*\\oo<l, 

(c) {Q*z)^ = 

(d) Qp* = s*r]. 



<o, p; = -i, 
= 0, p;g(-i,i) 

>0, pZ = l, 



for all forward arcs 7, 



(48) 
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a) imply that {Q*z,p' 



s*, while (48 



c) says that {Q*z,p*) = ||Q*z||i, and s* = 
a) z ^ 0, and thus 2; is a nonzero image with zero mean; recalling what Q 
1) entries in Q*z form VjZ, and the last n(n — 1) entries form VjZ, so that 



(|48}d) and (|48 
||Q*^||i. b7]48 
is, the first n(n — 

||Q*z||i = TV(2;). The gradient field of a nonzero image with zero mean cannot be identically 
zero, whence TV(2;) = = s* > 0. Thus x[7]] = —z/TY{z) = —z/s* is well defined and 

TV(x[r/]) = 1, while by (48 a) we have {x[r]],r]) = —1/s*. Finally, let x G TV, implying that Q*x 
is the concatenation of VjX and VjX and thus ||Q*x||i = TV(x) < 1. Invoking (48,6, d), we get 
— 1 < {Q*x,p*) = {x,Qp*) = s*{x,r]), whence {x,r]) > —1/s* = {x[r]],r]), meaning that x[rj] G TV 
is a minimizer of (??, x) over x £ TV. □ 



Proof of Proposition [T], In the sequel, for a real- valued function x defined on a finite set (e.g., 
for an image), ||x||p stands for the Lp norm of the function corresponding to the counting measure 
on the set (the mass of every point from the set is 1). Let us fix n and x G Mq with TV(x) < 1; 
we want to prove that 

\\x\\2 <C^/Mn) (49) 

with appropriately selected absolute constant C. 

1". Let © stand for addition, and - for substraction of integers modulo n; p (B q = {p + 
q) modn G {0, 1, n — 1} and similarly for pQq. Along with discrete partial derivatives VjX, Vjx, 
let us define their periodic versions VjX, Vjx: 

Vix{i,j) = x{i - x{i,j) : r„,„ R, Vjx{i,j) = x{i,j 1) - x{i,j) : r„,„ R, 

same as periodic Laplacian Ax: 

Ax = x{i,j) - ^ [x{i 1, j) + x{i + x{i,j 1) + x{i,j 1)] : Tn,n ^ R- 

For every j, < j < n, we have Y17=o "^i^ihj) = and Vix{i,j) = \7ix{i,j) for < i < n — 1, 
whence X^"Jo^ |Vj(x)| < 2 ^^Jg^ |Vix(i, j)| for every j, and thus ||Vjx||i < 2||Vix||i. Similarly, 
||Vjx||i < 2||Vjx||i, and we conclude that 

||Vix||i + ||Vjx||i < 2. (50) 

2*^. Now observe that for < i, j < n we have 

x{i,j) = x(i0l,j) + Vix(i0l,i) 
xii,j) = x{i®l,j)-Vix{i,j) 
x{i,j) = x{i,j Ql) + Vjx{i,j el) 
x{i,j) = x{i,j ®1) -Vjx{i,j) 

whence ^ 

Ax{i,j) = - Vix{ielJ)-Vix{iJ) + Vjx{i,jel)-Vjx{i,j) (51) 

Now consider the following linear mapping from M" x M" into M"": 

B[g,h]{i,j) = ^[g{iel,j)-g{i,j) + h{i,jel)-h{i,j)], [r,j] G r„,„. (52) 
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From this definition and (51) it follows that 

Ax = B[ViX,Vjx]. 



(53) 



3". Observe that B[g,h] always is an image with zero mean. Further, passing from images 
u £ M" to their 2D Discrete Fourier Transforms DFT[u]: 

DFT[n](p, q) = ^ u{r, s) exjp{-2m{pr + qs)/n}, [p; q] G r„,„, 

0<r,s<n 

we immediately see that every image u with zero mean is the periodic Laplacian of another, uniquely, 
defined, image X[u\ with zero mean, with X[u] given by 

[0, p = q = 

DFT[X[u]](p,g) = Y[u]{p,q) := | PFTMfag) ^ o/bj;g]Gr„„ ^ IP'^'il ^ '^n,n, ^^^^ 

D{p, q) = 1- l[cos{2Trp/n) + cos(27rg/n)], [p; q] £ r„,„. 



In particular, invoking (53), we get 

DFT[x] = Y[B[ViX,Vjx]]. 
By Parseval identity, ||DFT[x]||2 = n||x||2, whence 

\\x\\2 = n-^\\Y[B[ViX,Vjx]]\\2. 



Combining this observation with (50), we see that in order to prove (49), it suffices to check that 
(!) Whenever g,h £ M" are such that 

(g, h)£G:= {(5, h)GM^xM": \\g\\, + < 2}, 

we have 

\\Y[B[g,h]]\\2<nCy^ln(n). (55) 

4". A good news about (!) is that since y[i?[(7, /i]] is linear in {g,h), in order to justify (!), it 
suffices to prove that (55) holds true for the extreme point of G, i.e., (a) for pairs where h = and 
g is an image which is equal to 2 at some point of Tn,n and vanishes outside of this point, and (b) 
for pairs where g = and h is an image which is equal to 2 at some point of r„^„ and vanishes 
outside of this point. Task (b) clearly reduces to task (a) by swapping the coordinates i,j of points 
from r„^„, so that we may focus solely on task (a). Thus, assume that 5 is a cyclic shift of the 
image 26: 

1, [r,j] = [0;0] 



g{i,j) = 26{ier,jes), 5{i,j) 



0, [^;i]/[0;0] 

From ([52]) it follows that then B[g,0] is a cyclic shift of B[25,0], whence \BFT[B[g,0]]{p, q)\ = 
\BFT[B[26, 0]]{p, q)\ for aU [p; q] G r„,„, which, by implies that \Y[B[g, 0]]{p, q)\ = \Y[B[26, 0]]{p, q)\ 
for all [p;q] G r„^„. The bottom line is that all we need is to verify that (55) holds true for 
g = 26,h = 0, or, which is the same, that with 



y{p,q) 



[1 — exp{2Tnp/n}) 



2[l - i[cos(27rp/n) +cos(27rg/n)]] 



(56) 
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where the right hand side by definition is at p = g = 0, it holds 



n-l 



Cn:= Yl \y{p,q)\' <n^C^Hn). 

p,q=0 



Now, (56) makes sense for all [p;q] E (provided that we define the right hand side as zero at 
all points of where the denominator in (56) vanishes, that is, at all point where p, q are integer 
multiples of n) and defines y as a double-periodic, with periods n in p and in q, function of [p; q]. 
Therefore, setting m = Floor(n/2) > 1 and = {[p; G Z^ : —m < p, q < n — m}, we have 

C = V \v(v o)P = V |l-exp{27r^p/n}p 
" o,[.^e^ l,kw - U-osi2.p/n) + cos(2vr,/n)]p ' 

Setting p{p,q) = \/p^~+~q^, observe that when ^ [p',q] £ W, we have |1 — exp{2'inp / n}\ < 
Cin~^p{p, q) and 2[1 — ^[cos(27rp/n)+cos(27r(7/n)]] > C2n^^ p'^{p, q) with positive absolute constants 
Ci, C2, whence 

With appropriately selected absolute constant C3 we have 

/n 
r-Vdr = C3ln(n). 

U^[p;gjfcvy 



Thus, Cn < (Ci/C2)^C3n2 ln(n), meaning that and thus (|49|, holds true with C = \/C^Ci/C2. 

□ 



32 



